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Abstract 

We study the problem of adaptive control of a high dimensional linear quadratic 
(LQ) system. Previous work established the asymptotic convergence to an optimal 
controller for various adaptive control schemes. More recently, for the average 
cost LQ problem, a regret bound of 0{VT) was shown, apart form logarithmic 
factors. However, this bound scales exponentially with p, the dimension of the 
state space. In this work we consider the case where the matrices describing the 
dynamic of the LQ system are sparse and their dimensions are large. We present 
an adaptive control scheme that achieves a regret bound of 0{pVt), apart from 
logarithmic factors. In particular, our algorithm has an average cost of (1 + e) 
times the optimum cost after T ~ polylog(p)0(l/e^). This is in comparison to 
previous work on the dense dynamics where the algorithm requires time that scales 
exponentially with dimension in order to achieve regret of e times the optimal cost. 
We believe that our result has prominent applications in the emerging area of 
computational advertising, in particular targeted online advertising and advertising 
in social networks. 



1 Introduction 

In this paper we address the problem of adaptive control of a high dimensional linear quadratic (LQ) 
system. Formally, the dynamics of a linear quadratic system are given by 

x{t + l) = A'^x{t)+B^u{t) + w{t + l), 

c{t) = x{tfQx{t)+u{t)^Ru{t), (1) 

where u{t) £ W is the control (action) at time t, x{t) e is the state at time t, c{t) e K is 
the cost at time t, and {w{t + l)}t>o is a sequence of random vectors in R^* with i.i.d. standard 
Normal entries. The matrices Q £ K^^^ and R S W^'^ are positive semi-definite (PSD) matrices 
that determine the cost at each step. The evolution of the system is described through the matrices 
A° e W^P and _B" e Rp^^. Finally by high dimensional system we mean the case where p,r ^ 1. 

A celebrated fundamental theorem in control theory asserts that the above LQ system can be op- 
timally controlled by a simple linear feedback if the pair is controllable and the pair 
{A^,Q^/^) is observable. The optimal controller can be explicitly computed from the matrices 
describing the dynamics and the cost. Throughout this paper we assume that controllability and 
observability conditions hold. 

When the matrix Q'^ = [A'^, S"] is unknown, the task is that of adaptive control, where the system 
is to be learned and controlled at the same time. Early works on the adaptive control of LQ systems 
relied on the certainty equivalence principle |2 |. In this scheme at each time t the unknown param- 
eter 0" is estimated based on the observations collected so far and the optimal controller for the 
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estimated system is applied. Such controllers are shown to converge to an optimal controller in the 
case of minimum variance cost, however, in general they may converge to a suboptimal controller 
1 1 1 1. Subsequently, it has been shown that introducing random exploration by adding noise to the 
control signal, e.g., [ 14|, solves the problem of converging to suboptimal estimates. 

All the aforementioned work have been concerned with the asymptotic convergence of the controller 
to an optimal controller In order to achieve regret bounds, cost-biased parameter estimation IIT2I |8] 
d], in particular the optimism in the face of uncertainty (OFU) principle 1 13| has been shown to be 
effective. In this method a confidence set S is found such that O'^ G S* with high probability. The 
system is then controlled using the most optimistic parameter estimates, i.e., Q E S with the smallest 
optimum cost. The asymptotic convergence of the average cost of OFU for the LQR problem was 
shown in Q. This asymptotic result was extended in (|T| by providing a bound for the cumulative 
regret. Assume a;(0) =0 and for a control policy tt define the average cost 



where CTr(i) is the cost of control policy tt at time t and J^, ~ J(0o) is the optimal average cost. 
The algorithm proposed in [1] is shown to have cumulative regret 0{VT) where O is hiding the 
logarithmic factors. While no lower bound was provided for the regret, comparison with the multi- 
armed bandit problem where a lower bound of 0{^/T) was shown for the general case 1,9J, suggests 
that this scaling with time for the cumulative regret is optimal. 

The focus of [T| was on scaling of the regret with time horizon T. However, the regret of the pro- 
posed algorithm scales poorly with dimension. More specifically, the analysis in [IJ proves a regret 
bound of R{T) < CpP^''^^VT. The current paper focuses on (many) applications where the state 
and control dimensions are much larger than the time horizon of interest. A powerful reinforcement 
learning algorithm for these applications should have regret which depends gracefully on dimension. 
In general, there is little to be achieved when T < p as the number of degrees of freedom (pr + p^) 
is larger than the number of observations {Tp) and any estimator can be arbitrary inaccurate. How- 
ever, when there is prior knowledge about the unknown parameters , i?", e.g., when are 
sparse, accurate estimation can be feasible. In particular, [31 proved that under suitable conditions 
the unknown parameters of a noise driven system (i.e., no control) whose dynamics are modeled by 
linear stochastic differential equations can be estimated accurately with as few as 0(log(p)) sam- 
ples. However, the result of [3 | is not directly applicable here since for a general feedback gain L 
even if and are sparse, the closed loop gain A^ — B^L need not be sparse. Furthermore, 
system dynamics would be correlated with past observations through the estimated gain matrix L. 
Finally, there is no notion of cost in while here we have to obtain bounds on cost and its scaling 
with p. In this work we extend the result of (|3] by showing that under suitable conditions, un- 
known parameters of sparse high dimensional LQ systems can be accurately estimated with as few 
as 0(log(p + r)) observations. Equipped with this efficient learning method, we show that sparse 
high dimensional LQ systems can be adaptively controlled with regret 0{pVT). 

To put this result in perspective note that even when x{t) ~ 0, the expected cost at time t + 1 is 
51 (p) due to the noise. Therefore, the cumulative cost at time T is bounded as Q,{pT). Comparing 
this to our regret bound, we see that for T = polylog(p)0(p-), the cumulative cost of our algorithm 
is bounded by (1 + e) times the optimum cumulative cost. In other words, our algorithm performs 
close to optimal after polylog(p) steps. This is in contrast with the result of [jj where the algorithm 
needs Q.{p'^P) steps in order to achieve regret of e times the optimal cost. 

Sparse high dimensional LQ systems appear in many engineering applications. Here we are par- 
ticularly motivated by an emerging field of applications in marketing and advertising. The use of 
dynamical optimal control models in advertising has a history of at least four decades, cf. ||T7][T0l 
for a survey. In these models, often a partial differential equation is used to describe how advertising 
expenditure over time translates into sales. The basic problem is to find the advertising expendi- 
ture that maximizes the net profit. The focus of these works is to model the temporal dynamics of 




(2) 



Further, define the cumulative regret as 



T 




(3) 
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the advertising expenditure (the control variable) and the variables of interest (sales, goodwill level, 
etc.). There also exists a rich literature studying the spatial interdependence of consumers' and 
firms' behavior to devise marketing schemes [7]. In these models space can be generaUzed beyond 
geographies to include notions like demographies and psychometry. 

Combination of spatial interdependence and temporal dynamics models for optimal advertising was 
also considered |16, 15 1. A simple temporal dynamics model is extended in ifTSll by allowing state 
and control variables to have spatial dependence and introducing a diffusive component in the con- 
trolled PDE which describes the spatial dynamics. The controlled PDE is then showed to be equiv- 
alent to an abstract linear control system of the form 

^ ^ Ax{t) + Bu{t). (4) 



Both fT5\ and fT^ are concerned with the optimal control and the interactions are either dictated 
by the model or assumed known. Our work deals with a discrete and noisy version of (|4|i where 
the dynamics is to be estimated but is known to be sparse. In the model considered in [15] the 
state variable x lives in an infinite dimensional space. Spatial models in marketing |7| usually 
consider state variables which have a large number of dimensions, e.g., number of zip codes in the 
US (^ 50K). High dimensional state space and control is a recurring theme in these applications. 

In particular, with the modem social networks customers are classified in a highly granular way, po- 
tentially with each customer representing his own class. With the number of classes and complexity 
of their interactions, its unlikely that we could formulate an effective model a priori for how classes 
interact. Further, the nature of these interactions change over time with the changing landscape of 
Internet services and information available to customers. This makes it important to efficiently learn 
from real-time data about the nature of these interactions. 

Notation: We bundle the unknown parameters into one variable 8° = [yl",_B"] e R^^"? where 
q = p + r and call it the interaction matrix. For v G K", M G ]|j™x" and p > 1, we denote by ||u|jp 
the standard p-norm and by || A/||p the corresponding operator norm. For 1 < i < m, Mi represents 
the row of matrix M. For S C [m] , J C [n], Msj is the submatrix of AI formed by the rows in 
S and columns in J. For a set S denote by 1 5*1 its cardinality. For an integer n denote by [n] the set 
{1, •••,"}. 



2 Algorithm 

Our algorithm employs the Optimism in the Face of Uncertainty (OFU) principle in an episodic 
fashion. At the beginning of episode i the algorithm constructs a confidence set fi^*-' which is 
guaranteed to include the unknown parameter 8*^ with high probability. The algorithm then chooses 
0(*) G fi(*) that has the smallest expected cost as the estimated parameter for episode i and applies 
the optimal control for the estimated parameter during episode i. 

The confidence set is constructed using observations from the last episode only but the length of 
episodes are chosen to increase geometrically allowing for more accurate estimates and shrinkage 
of the confidence set by a constant factor at each episode. The details of each step and the pseudo 
code for the algorithm follows. 

Constructing confidence set: Define to be the start of episode i with tq = 0. Let L^^^ be the 
controller that has been chosen for episode i. For t G [r^, Ti+i) the system is controlled by u{t) = 
—L^'''^x{t) and the system dynamics can be written as x{t-\- 1) — {A^ — L'^''^)x{t) + w{t + 1). At 
the beginning of episode i + first an initial estimate is obtained by solving the following convex 
optimization problem for each row 0„ € separately: 

el'+i) e argmin /:(e„) + A|ie„|li, (5) 

where 

^(Q") = ^7a V {.T„(t + l)-e„L«a;(i)}2, An+, = - n, (6) 
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Algorithm: Reinforcement learning algorithm for LQ systems. 



Input: Precision e, failure probability 45, initial (p, Cmin, «) identifiable controller £{Q'^, e) 
Output: Series of estimates 8*-*', confidence sets 57^*^ and controllers L^*) 
1: Let ^0 = max(l, maxjgj,.] ||Lj'^''||2), and 



4 ■ 103 k^il / 1 k \ Akq, 



4- 103fc2£(eo,e)2 / I k \, Akq, 



Let Ato = no, Ar^ = 4*(1 + i/ log((7/(5))ni for i > 1, and t; = X]j=o ^'T?- 
fori = 0,1,2,... do 

Apply the control u{t) = —L^^^x{t) until Ti+i — 1 and observe the trace {2:(i)}Ti<t<ri+i- 
Calculate the estimate 0('+i) from (|5]l and construct the confidence set ri(*+^-'. 
Calculate G^'+i) from (01 and set L^'+i) ^ L(e(*+i)). 



and Z^'' = [/, — L^*)^]^. The estimator 8„ is known as the LASSO estimator The first term 
in the cost function is the normalized negative log likelihood which measures the fidelity to the 
observations while the second term imposes the sparsity constraint on 8u. A is the regularization 
parameter. 

For 9(1), 6(2) g RPX9 define the distance d(e(i), O^^)) as 

d(e(i),e(2)) = max|ieW-e(f)|U, (7) 

ueip] 

where Qu is the u*'* row of the matrix 8. It is worth noting that for fc-sparse matrices with k 
constant, this distance does not scale with p or q. In particular, if the absolute value of the elements 

of 6(1) and 9(2) are bounded by G^ax then d(e(i), 6(2)) < 2%/fce,„ax. 

Having the estimator 8(*) the algorithm constructs the confidence set for episode i as 

r!(') = {9 e ^« I d(e, §(*)) < 2-*e}, (8) 

where e > is an input parameter to the algorithm. For any fixed (5 > 0, by choosing Ti judiciously 
we ensure that with probability at least 1 — 5, 9" G ri(*), for all i > 1. (see Theorem 13. 2l i. 

Design of the controller: Let J(9) be the minimum expected cost if the interaction matrix is 
9 = and denote by L(9) the optimal controller that achieves the expected cost J(9). The 

algorithm implements OFU principle by choosing, at the beginning of episode i, an estimate 9(*) G 
ri(*) such that 

9(') e argmin J(9). (9) 

The optimal control corresponding to 9(*) is then applied during episode i, i.e., u{t) = 
— L(9('))a;(t) for t <E [r^, Ti+i). Recall that for 9 = [A, B], the optimal controller is given through 
the following relations 

K{e) = Q + A^K{e)A - A^K{Q)B{B^K{Q)B + R)-^B'^K{Q)A , (Riccati equation) 
i(9) = {B'^K{Q)B + R)-^B'^K{Q)A. 
The pseudo code for the algorithm is summarized in the table. 

3 Main Results 

In this section we present performance guarantees in terms of cumulative regret and learning ac- 
curacy for the presented algorithm. In order to state the theorems, we first need to present some 
assumptions on the system. 
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Given 6 G MP^« and L e W'^'p, define L = [I, -L'^]^ e W'p and let A e Rp^p be a solution to 
the following Lyapunov equation 

A - eZAZ"^e"^ = /. (10) 

If the closed loop system {A''* — B^L) is stable then the solution to the above equation exists and the 
state vector x{t) has a Normal stationary distribution with covariance A. 

We proceed by introducing an identifiable regulator. 

Definition 3.1. For a k-sparse matrix 9" = [A°,B^] e W""^ and L e W'p, define L = 
[I, -L^]^ e WP and let H = LM7 where A is the solution of Eq. ^ with 9 = 9". De- 
fine L to be (p, Cmin, cy) identifiable (with respect to 9°j if it satisfies the following conditions for 
alls C [q], IS*! < k. 

{1)\\A" -B°Lh<p<l, (2) A„,i„(iJss) > C^min, {5)\\Hs^sHs^\\oo<l-a- 

The first condition simply states that if the system is controlled using the regulator L then the closed 
loop autonomous system is asymptotically stable. The second and third conditions are similar to 
what is referred to in the sparse signal recovery literature as the mutual incoherence or irreprep- 
resentable conditions. Various examples and results exist for the matrix families that satisfy these 
conditions |18|. Let S be the set of indices of the nonzero entries in a specific row of 9". The 
second condition states that the corresponding entries in the extended state variable y — [x^ , u^] are 
sufficiently distinguishable from each other. In other words, if the trajectories corresponding to this 
group of state variables are observed, non of them can be well approximated as a linear combination 
of the others. The third condition can be thought of as a quantification of the first vs. higher order 
dependencies. Consider entry j in the extended state variable. Then, the dynamic of yj is directly 
influenced by entries ys- However they are also influenced indirectly by other entries of y. The third 
condition roughly states that the indirect influences are sufficiently weaker than the direct influences. 
There exists a vast literature on the applicability of these conditions and scenarios in which they are 
known to hold. These conditions are almost necessary for the successful recovery by £i relaxation. 
For a discussion on these and other similar conditions imposed for sparse signal recovery we refer 
the reader to 1 19 | and [201 and the references therein. 

Define 9niin — '^^'^ie[p],je[q].e° -^o ^^^^ result states that the system can be learned 

efficiently from its trajectory observations when it is controlled by an identifiable regulator 

Tlieorem 3.2. Consider the LQ system of Eq. ([TJ and assume = [A^,B'^] is k-sparse. Let 
u{t) = -~Lx{t) where L is a (p, Cmin, ct) identifiable regulator with respect to 9" and define 
£ — max(l, maxjg[r] ||Lj||2). Let n denote the nut^iber of samples of the trajectory that is observed. 
For any < e < min(9min, |, there exists A such that, if 

then the li-regularized least squares solution 9 of Eq. Q satisfies d(9, 9°) < e with probability 
larger than \ — 5. In particular, this is achieved by taking A — M^J\og{Aq/ 5) / {na^{l — p)) . 

Our second result states that equipped with an efficient learning algorithm, the LQ system of Eq. ^ 
can be controfled with regret 0{pVT log ^ (1/(5)) under suitable assumptions. 

Define an e-neighborhood of 9" as 7Ve(9") = {9 G W^'^ \ d(9°, 9) < e}. Our assumption asserts 
the identifiably of L(9) for 9 close to 9°. 

Assumption: There exist e, C > such that for all 9 G AC (9°), L{Q) is identifiable w.r.t. 9" and 

aL(9",e)= sup ||L(9)||2<C, (7^(9", e)= sup \\K{Q)\\2<C. 



Also define 



^(9°,e)= sup max(l, max ||ij(9)||2) 



Note that £(9",e) < max(C, 1), since max^gf^] \\Lj{Q)\\2 < ||i(9)||2. 
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Theorem 3.3. Consider the LQ system of Eq. ([T]i. For some constants e, C,nin ond < a, p < 1, 
assume that an initial (p, Cmin, ct) identifiable regulator L^^^ is given. Further, assume that for any 
Q £ Afe{Q'^), L{Q) is (p, Cmin, ct) identifiable. Then, with probability at least 1 — S the cumulative 
regref o/ Algorithm (cf. the table) is bounded as 

R{T)<6{pVT\ogi{l/5)), (12) 
where O is hiding the logarithmic factors. 



4 Analysis 

4.1 Proof of TheoremSJl 



To prove theorem lJSj we first state a set of sufficient conditions for the solution of the ^!i-regularized 
least squares to be within some distance, as defined by d(-, •), of the true parameter Subsequently, 
we prove that these conditions hold with high probability. 

Define X = [x{0),x{l), ...,x{n-l)] e Rp^" and let W = . . . , w{n)] e K^^" be the 

matrix containing the Gaussian noise realization. Further let the Wu denote the u row of W . 

Define the normalized gradient and Hessian of the likelihood function ^ as 

G = -V C{Ql) ^ -LXWJ , H = S/^ael) = -LXX'^L'^ . (13) 

n n 

The following proposition, a proof of which can be found in |20|, provides a set of sufficient condi- 
tions for the accuracy of the £i-regularized least squares solution. 

Proposition 4.1. Let S be the support of with \S\ < k, and H be defined per Definition \3.1\ 
Assume there exist < a < 1 and Cmin > such that 

Amin(ffs.5) > Cmin, 1 1 ^5^.5^5.5 II - < 1 - (14) 

For any < e < 0min if the following conditions hold 

l|G||oo<^, ||G5||oo<^-A, (15) 

WHscs - HscsWoo < , \\Hss HssWo. < , (16) 

the £i-regularized least square solution ^ satisfies d(0„, O^) < e. 



In the sequel, we prove that the conditions in Proposition |4. ll hold with high probability given that the 
assumptions of Theorem |3.2| are satisfied. A few lemmas are in order proofs of which are deferred 
to the Appendix. 

The first lemma states that G concentrates in infinity norm around its mean of zero. 

Lemma 4.2. Assume p — \\A'^ — B^L\\2 < 1 and let £ — max(l, maxjgj^] ||-^i||2)- Then, for any 

S C [q] and < e < | 

F{||G5||oo > e} < 2|5| exp (^-it^) . (17) 

To prove the conditions in Eq. ( fTSI l we first bound in the following lemma the absolute deviations 
of the elements of H from their mean H, i.e., \Hij — Hij\. 

Lemma 4.3. LetiJ € [q], p = \\A" - B°L\\2 < I, and < e < < n. Then, 

- H,,\ >e)< 2exp (- "^\~/'' ) • (18) 
The following corollary of Lemma |4.3| bounds \\Hjg — Hjs\\oo for J, S* C [q]. 
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CoroUary4.4. Let J,S C [q], p = - B^L^ < 1, e < andn > j^. Then, 

nWHjs HjsWoo >e)< 2\J\\S\ exp L!^^^—^\ . (19) 



The proof of Corollary I4.4l is by applying union bound as 

P(||ii-js-i^Js||co >e) < |J||5| max P(|ff„--i/,,| > e/|5|). (20) 



Proof of Theorem \3.2\ We show that the conditions given by Proposition 14. II hold. The conditions 
in Eq. (fl4l l are true by the assumption of identifiability of L with respect to Q^. In order to make the 
first constraint on G imply the second constraint on G, we assume that \a/'i < eCmin/ {Ak) — A, 
which is ensured to hold if A < eCmin/(6A:). By LemmaE^l P(||G||oo > Aa/3) < 5/2 if 

n{l — p)a^ 

Requiring A < eCmin/ (6A:), we obtain 



^' = 37^— r^log(^)- (21) 



e " C'mi„(l - P) ^ 

The conditions on can also be aggregated as ||i?[<j],s— -ff[<;],s||oo < aCmin/(12\/A;) . By Corollary 

Sa Pdl^Ms - ^TmsIIoo > aG„i„/(12\/fc)) < 5/2 if 

3456fc3£2 

'''' 

Merging the conditions in Eq. ( l22l i and (l23T l we conclude that the conditions in Proposition ^. 1 I hold 
with probability at least 1 — 5 if 

4-103fc2£2 fi A: \ , Akq^ 



Which finishes the proof of Theorem l3.2l □ 
4.2 Proof of Theorein|33] 

The high-level idea of the proof is similar to the proof of main Theorem in |T|. First, we give a 
decomposition for the gap between the cost obtained by the algorithm and the optimal cost. We then 
upper bound each term of the decomposition separately. 

4.2.1 Cost Decomposition 

Writing the Bellman optimality equations Il5]|4l for average cost dynamic programming, we get 

J{@t) + x{ty K{Qt)x{t) = min |a;(i)"^Qa;(t) + Ru + E[z(t + l)^K{Qt)z{t + l)\Ft 

where Qt — [A, B] is the estimate used at time t, z{t + 1) = Atx{t) + Btu + w{t + 1), and Ft 
is the (T-field generated by the variables {(zr, Xr)YT=o- Notice that the left-hand side is the average 
cost occurred with initial state x{t) [SJE). Therefore, 

J{Qt) + x{ty K{Qt)x{t) = x{tyQx{t) + u{tyRu{t) 

+ E[(Ita;(t) + Btu{t) + w{t + l)y K{Qt){Atx{t) + Btu{t) + w{t + l))\Ft] 

= x{tyQx{t) + u{tyRu{t) + E[(Ifx(t) + Btu{t)y K{Qt){Atx{t) + Btu{t))\Ft] 

+ ¥.[w{t + iyK{Qt)w{t + l)\Ft] 
= x{tYQx{t) + u{tyRu{t) +W.[x{t + iyK{Qt)x{t + 1)| -Ff] 

+ ({Atx{t) + Btu{t)yK{Qt){Atx{t) + Btu{t)) 

- {A°x{t) + B°u(t)YK(Qt){A''x{t) + B%{t)) 
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Consequently 

T 



J2 {x{t)'^Qx{t) + u{t)'^Ru{t)) = J2 Ji^t) + C1+C2 + C3, (25) 



t=o t=o 



where 



Ci^J2[ ^{tYK{Qt)x{t) ^^[x{t + lYK{Qt+i)x{t + l)\Ft] J , (26) 

T 

C2 = -^E[a;(t + l)T(if(et)-i^(et+i))a;(t + l)|j-t], (27) 
t=o 

T 

C3 = - ^ ((Itx(i) + Btu{t)y K{Qt){Atx{t) + Btu(t)) 
t=o 

- + B%(t))Tx(et)(A"x(t) + . (28) 

4.2.2 Good events 

We proceed by defining the foUowing two events in the probability space under which we can bound 
the terms Ci, C2, C3. We then provide a lower bound on the probability of these events. 



Si = {e" e n^"^, for i > 1}, £2 = {\\w{t)\\ < 2y/p\og{T/d), for 1 < t < T + 1}. 
4.2.3 Technical lemmas 

The following lemmas establish upper bounds on Ci , C2, C3. 

Lemma 4.5. Under the event £1 H £2, the following holds with probability at least \ — 5. 



Ci<^-^VTplog(-)^log(-). (29) 
Lemma 4.6. Under the event £i H £2, the following holds. 

C2 < j^-—^p\og{-)\ogT . (30) 
Lemma 4.7. Under the event £i H £2, the following holds with probability at least 1 — 5. 



,03, <_ 800(^) '.^(: . ^) ■ ■ lo,,?! ,f s-,f ) -..o^rvT. (3i, 

Lemma 4.8. The following holds true. 

T{£i) > 1 - (5, V{£2) >l-5. (32) 

Therefore, P(f 1 n £2) > 1 - 2(5. 

We are now in position to prove Theorem 13. 3 1 



Proof (Theorem UTSl . Using cost decomposition (Eq. (l25ll). under the event £1 H £2, we have 

T T 
i=0 t=0 

<Tj{e°) + Ci + C2 + C3, 

where the last inequality stems from the choice of 9f by the algorithm (cf. Eq (|9]l) and the fact that 
0" S fit, for all t under the event £1. Hence, R{T) < Ci + C2 + C3 . Now using the bounds on 
Ci , C2, C3, we get the desired result. □ 
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A Proof of technical lemmas 



A. 1 Proof of Lemma Hil] 



As before let p = \\A^ — B'^L\\2, and I = max(l, maxjg[r] ||ii||2)- Further, for u G [p], 
j E [q], define (/)(t) e Rp^p to have all rows equal to zero except the u*'* row which is equal 
to Lj{A^ - B^LY . Define $j e ]R"px"p as. 











^ \ 


m 













m ■ 




















(j){n - 3) . 


• '^(0) 


0/ 



and let 



Lemma A.l. Let vi denote the i eigenvalue of^j and assume p < 1. Then, 

np 
i=l 

max I I < 

i 

np 



i=l 



< 



en 

2(1 -P)' 



(33) 



(34) 

(35) 
(36) 
(37) 



We do not prove this lemma here and refer the reader to Lemma A.3 in 13J. 

Proof ( Lemma \4.2\ . The proof of this lemma follows closely the proof of Proposition 4.2 in |31 
which we provide here for the reader's convenience. Let w G M"^ be the vector obtained by 
stacking all the noise vectors up to time n, i.e., 

wT = [u;(l)\u;(2)T,...,«;(n)T]. 

Then we have that 

n — 1 n— 1 t — 1 

= Lj x{t)wu{t + 1) = X! ^"(* + 1) H ^J^"^" " B"Lyw{t - r) = w'^^jw. 
where is defined in ( [34l i. Since w ^ N(0, /„p) and <i>j is symmetric, we can write 

np 

Gj=^z.,zf. (38) 

where Zi ~ N(0, 1) are independent and i/i's are the eigenvalues of the matrix . 
Now we have for any /3 > 0, 

np pn 

P| ^ > „g| < (-n;3e) Yl E{ exp (/Si/.z^^) } 

(1 \ 
-n[^8e+—J2log{l-2,y,(3)U . 
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Let /3 = (1 - p)e/(2^2). Then it follows from Eq. (I36]l and the assumption e < § that \2uif^\ < 1 /2. 
Furthermore, for |a;| < 1/2, log(l — a;) > —x — x^. Hence, 



np I 1 \ 

^ v^zi > ne) < exp -n(/3e - 2/?'- ^ i^f) 

i=l V 4=1 / 



where the first inequality follows from the fact that J27^i Vi = Q (Eq. ( [35] )) and the second inequality 
is obtained using the bound in Eq. ( l37b . Finally, by the union bound we obtain the desired result 

P{||G5||oo > e} < SIS'! maxP{z"r$jZ > ne] 
^ 2|5| exp ( - J' 



□ 



A.2 Proof of Lemma |431 



Lemma A.2. Let Rj e m(«-i)px«p be obtained by removing the first p rows of ^j. For i,j S [q] 
define R{i, j) — 1 /2{RJR^ + RJRj) G K"''^"^. Assume p < 1 and let vi denote the Z*'' eigenvalue 
ofR{i,j). Then, 

l^d < (39) 
1^2 2£W 3 1 \ 



Proof ( Lemma \4.3[ . Our proof of |4.2| here closely follows the proof of Proposition 4.2 in |[3]. 



Note that Hij can be written as. 



t=i 

n-l t-1 t-1 -p 

- ^ E ^'(E(^" - ^"^)^^(* - -)) (E(^" - B'^y^i' - -)) 

t = l T=0 T = 

^ n-l t-1 t-1 J 

- ^ E ( E ^«(^" - - r)wit -rV{Y: - B'LY) 

t=l r=0 T=0 

= ^ E -(^ - ( E - B'^y) ^ ( E - b'ly) Ht-r) 



n 

t=l r=0 T=0 



= -w"^i?(i,j)w. 

n 

Since w ^ N(0, /„p) and j) is symmetric, we can write 

^. = -E 

1=1 

where zi ^ \^ (0, 1) are independent and j^/'s are the eigenvalues of the matrix j). Further, 



np 

^nzf. (41) 

1=1 



np 

H,, - E(i?,, ) = - ^ (zf „ 1) ^ (42) 



71 

/=1 
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Hence, using Chemoff bound we get 

np 

¥{H,j - E(i/,y) > e) - - 1) > en) 



1=1 



/ 1 \ 

<exp(-/3en)exp ( -- ^ log(l - 2/3^/,) 1 



By Lemma lArSl for n > y^- we have 



3^2 



' - 1 



(l-p)3' 



(43) 



Let P = -^-f^jT-^- By assumption e < we have |2/3z^i| < 1/2. Using the inequality log(l — a;) > 
—X — for \x\ < 1/2, we obtain 

/ np \ 

P{Hij - E{Hij) > e) < exp -/3en + 2/?^ ^ i/f 



1=1 

I I /, I I — / / 

< exp 



n(l-p)3e2 



24^2 

which finishes the proof. □ 
A.3 Proof of Lemma 14.51 

Before embarking on the proof, we state and prove the following claim which will be repeatedly 
used in the proofs of Lemmas 14. 51 1431 and |4.7| 

Proposition A.3. Under the event £i n £2, the following holds true. 



Mm < Y^^piog(|),/ori<t <r + i. 

Proof ( Proposition lA. 51 ). Conditioning on the event £1 , 8° G fi*^*^ fori > 1. Furthermore, for all 
i > 1, C AC (6°). Recall our assumption that for all 6 G ^(6°), -^(6) is identifiable with 
respect to 8°. Consequently, we have ||j4" — B'^Lt\\2 < p, for all t > 1, where Lt denotes the 
controller (used by ALGORITHM ) at time t. Now, we write for 1 < i < T + 1, 

II^WII^IlE ri {A"-B"Lt)w{h)\\<j2p'~''Mti)\\ 
t,=it^=u+i t,=i 

ti—i 

where the second inequality holds since we are conditioning on £2- D 



Armed with this proposition, we prove Lemmal4.: 
Define z{t) = A°x{t) + B^uit), and Kt = K{Qt) for all t > 0. Since x{0) = 0, we have 



Ci=J2(x{t)'^Ktx{t)-E[x{t + iyKt+ix{t + l)\j^t]) 
t=o ^ ^ 



= -E[xiT + iyKT+ixiT + l)\j^T] +Y,[^iiVKtx{t)~E[x{tfKtx{t)\J^t-i] 

t=i ^ 

Because Kt+i is PSD, the first term is bounded above by zero. To bound the second term, define 



£{ = {8° e ^r, for 1 < r < i} , = {IkMH < 2^plog(r/(5), for 1 < r < t} . 
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Note that C £\ and £2 Q £2- Following the approach of [ 1], it can be shown that the second 
term is bounded above by 



Define the martingale 

Mr = Y.h£l-'n£l-'} (^{tVKtxit) - E[x{t)'^ Ktx{t)\Tt-i]) , Mo = . 
Note that Mr is a martingale since £1^^ and£*"^ are J^t-i measurable. In addition, 

\Mr - Mr-i\ < l^ei-^nst'} l^i^V Ktx{t) - E[x{ty Ktx{t)\J^t 



< 



8C , 



8C , 

where the penultimate inequality follows from Proposition |A3] Applying Azuma's inequality, 

F{Mt - Mo > /3) < exp { . 



128 TC Vlog^(i 



Hence, with probability at least 1 — 6, 



B Proof of Lemma 14.61 

If the confidence set is not updated at time t + 1, i.e., ilt — ^t+i, then K{Qt) — K{Qt+i) and the 
t-th term in the summation is zero. The way ALGORITHM chooses the lengths of the episodes, Ar^, 
the number of updates (number of times ALGORITHM changes the policy) is at most log4 T up to 
time T. Using the bound ||ii:(6t)||2 < C, for t > 1, we have 

T 

C2 = -J2E[x{t + lV{K{et) - K{Qt+i))x(t + \)\Ft] 
t=o 

< ^ 2C\\xrM^ (45) 

i:ri<T 

SC T 
where we used Proposition IA.3 l in the last step. 



C Proof of Lemma m] 

Let j/f = [xj , ulY e R'^^^. We first establish the following proposition. 

Proposition C.l. Under the event £1 D £2, The following holds true with probability at least 1 — S: 

T 

~ in T 

^ 11(9° - Qt)ytr < -— ^p62log(-)(logT)2ni , (46) 



t=o 

where ni is defined in ALGORITHM . 
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Proof ( Proposition I C. i I ). Write 

f]||(e"-e,)2/dP= E "E'll(0"-e.)y.ll 



t=0 i:Ti<T t=ri 

Ti + l-l 



Ti + l-l 



< E E 2||(A°-B"LW-At+BtL(*))(A"-B°LW)*-^'+ix(r, -1)1 

i:ri<T t=Ti 



;<T t = Ti tl=Ti 



t 



where we used 

x{t) = (A" - B"L^''>y-^'+^x{n - 1) + E ~ 

We proceed by bounding the first term as follows: 

^.+1-1 

E E 2||(A0-B"LW-A+BtL«)(A"-B0L«)*-"'+ix(r, -1)112 

i:Ti<T t=Ti 

< E E 2d(e",eoV'(*~^'+')|k(r.-i)f (47) 

T 

-7T^^^°s(|)Ed(0°.0O^ 

To bound the second term define the matrix 

Dt = - B°l'-'^ -At + BtL^'^)[I, [A^ - B°L)\ {A° - B^Lf, • ■ • , (A° - fiOL)*-^']. (48) 

The second term can be written as X]t=o 2 1 1 -DtW^ 1 1 ^ where w; is the vector obtained by stacking all 
the noise vectors in episode i, i.e., 

wj ^[w{t)\w{t-l)\...,w{nV]'' . 

Hence, the r*'* entry in the vector DtWi is a normal random variable with variance at most 
where Dtr is the r*'* row of matrix Dt and 

,2 ^ d(e°,et) 2 

(l-p2) 

Using standard normal tail bound we get 

P(( A.wO^ > 50 < exp ( - ^^^5* ) . (50) 

Taking 

2 d(e",et)^ ,pT^ 

and applying union bound for re [p], and 1 < i < T, we obtain 

P {{DtrV^if <gt, for \ <t <T,r e [p]) > I - 5 . (52) 
Consequently, with probabihty at least 1 — 5, the second term is bounded by 

T T T 

2E llAw.lP < 2Ep.9. < -^plog(^)Ed(e°,eO'. (53) 



Arlr < ' 2^ ■ (49) 
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Finally, using Theorem l3.2l and the choice of episodes, we have 

t=0 i:Ti<T i:Ti<T ^^S' S>\S, 



Til 



where we have used the fact that the number of episodes up to time T is at most log T. Combining 
Eqs. Wt\ and ( |53] |. we have 



E ll(e» - e,)!,.|l' < { iTT^piori j) + |r^pi°g(T>} §.)^ 



□ 



Corollary C.2. Using the value of ni, defined in ALGORITHM , we have with probability at least 

1-5, 

Now, we are ready to bound C3. 

T 



|t^3|<E 



||i^(e)i/2e7y,f - ||if(e)i/2e°"ry,||^ 

T , . 2x 1/2 



< 



5]^||x(e,)i/2e,y,||-||i^(e,)^/^e%|| 

T , ^ ^ _ ^ 2^ 1/2 

E ||if(e,)i/2e,2/,|| + ||x(e,)^/^eVll 

T ^ ^ ^1/2 

<(Y,\\K{Q,Y^'iQ,~Q")y,\\ 

T |- ^ ^ _ ^ 2^ 1/2 

Y,\\\K{e,)'/^e,y,\\ + \\K{ety/'eW\ 



(56) 



/ ^ X 1/2 . T .1/2 

^ t=0 ^ ^ t=0 ^ 

CoroUarv IC .2 [ provides an upper bound for the first term on the right hand side. In addition, 

T T 
t=0 t=Q 

<il + aiLtr)^^^plogij)T (57) 
4(1 + C2) , 



Here, the first inequahty follows from Proposition |A3] Combining the bounds for the terms on the 
right hand side of Eq. (|56] |, we obtain 
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D Proof of Lemma Hi 



We first show that P(£'i) > 1 — S. According to Theorem 13.21 the sample complexity scales with 
(1/e^) log(g/(5). Due to the choice of episode lengths in the algorithm, namely Ar^ = 4*(1 + 

i/\og{q/S))ni, with probability at least 1 — S/2\ we have d(0o, O**') < 2"*e and thus 9o e fi*''. 
Now by applying union bound for i > 1, 

OO J. 

n£i)>l-Y.-^l~S. (59) 

1=1 

Next we prove the lower bound for the probability of event £2- Let ?z;(i) e K.^ be the noise vector at 
time t with i.i.d standard normal entries. For any t > 1 and any A, we have 

P{lk(OII' > M - P{e''5:f=i > e^^P} 

1=1 (60) 

= exp(-p{A0+ilog(l-20)}) 

< exp(-p{A6'- 61- 26|2}), f or < 6* < 1 

where we used the fact that if |a;| < 1, then log(l ~ x) > —x — x^. Choosing 9 = 1/2, and 
A = 41og(T/(5), we obtain 

¥{\\wit)f > Aplog{T/S)} < exp(-plog(r/5)) - 
Finally, by applying union bound for 1 < t < T + 1, 

¥{£2) =P{\\w{t)\\ < 2y/p\og{T/6), fori < t < r + 1} 

T (61) 

>l-(T+l)(-)-f > 1-5. 
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